{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb33810e-c6be-48cc-84a3-a79794c6a4b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License.\n",
    "# =============================================================================="
   ]
  },
  {
   "cell_type": "markdown",
   "id": "104213a3-18c3-4384-b923-59e35a163093",
   "metadata": {},
   "source": [
    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
    "\n",
    "# BERT QA Inference on TensorRT INT8: Quantization Aware Training (QAT) and Structured Sparsity\n",
    "\n",
    "This notebook demonstrates the use of BERT model with TensorRT in QAT INT8 and structured sparsity mode. These are two new features introduced since TensorRT 8.\n",
    "\n",
    "**Quantization Aware Training**: Using INT8 precision with quantization scales obtained from Post-Training Quantization (PTQ) can produce additional performance gains, but may also result in accuracy loss. Alternatively, for PyTorch-trained models, NVIDIA PyTorch-Quantization [toolkit](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/index.html) can be leveraged to perform quantized fine tuning (a.k.a. Quantization Aware Training or QAT) and generate the INT8 quantization scales as part of training. This generally results in higher accuracy compared to PTQ.\n",
    "\n",
    "**Structured Sparsity**: Fine-grained 2:4 structured sparsity support introduced in NVIDIA Ampere GPUs can produce significant performance gains in BERT inference. The network is first trained using dense weights, then fine-grained structured pruning is applied, and finally the remaining non-zero weights are fine-tuned with additional training steps. This method results in virtually no loss in inferencing accuracy.\n",
    "\n",
    "For more information on sparsity and how to train sparse models, see the GTC [talk](https://gtc21.event.nvidia.com/media/Making%20the%20Most%20of%20Structured%20Sparsity%20in%20the%20NVIDIA%20Ampere%20Architecture%20%5BS31552%5D/1_0j8hi0r7) titled \"Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture.\"\n",
    "\n",
    "TensorRT since version 8 supports both QAT and structured-sparsity trained networks. \n",
    "\n",
    "## Pre-requisite\n",
    "Follow the instruction at https://github.com/NVIDIA/TensorRT to build the TensorRT-OSS docker container required to run this notebook.\n",
    "\n",
    "## Content\n",
    "1. [Download data and model](#1)\n",
    "1. [Building a INT8-Sparsity TensorRT optimized BERT model](#2)\n",
    "1. [Running inference examples](#3)\n",
    "1. [Inference benchmarking](#4)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99997954-bb39-42ba-8247-c2e47d213b29",
   "metadata": {},
   "source": [
    "<a id=\"1\"></a>\n",
    "\n",
    "## Download data and model\n",
    "First, we download the \n",
    "Stanford Question Answering Dataset ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)) dataset and a pre-trained BERT QA model from NVIDIA GPU Cloud ([NGC](https://ngc.nvidia.com/catalog/models/nvidia:bert_pyt_ckpt_base_qa_squad11_amp)).\n",
    "### SQUAD dataset\n",
    "\n",
    "Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93cfa328-1d7e-49ea-aa4e-23005f30b460",
   "metadata": {},
   "outputs": [],
   "source": [
    "!bash ../scripts/download_squad.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a699716-e432-4520-8b8e-fc12e039f3b5",
   "metadata": {
    "jupyter": {
     "outputs_hidden": true
    },
    "tags": []
   },
   "source": [
    "### Fine-tuned BERT Large Model download\n",
    "\n",
    "Many AI applications have common needs: classification, object detection, language translation, text-to-speech, recommender engines, sentiment analysis, and more. When developing applications with these capabilities, it is much faster to start with a model that is pre-trained and then tune it for a specific use case. The NGC [catalog](https://ngc.nvidia.com/catalog/models) offers pre-trained models for a variety of common AI tasks that are optimized for NVIDIA Tensor Core GPUs, and can be easily re-trained by updating just a few layers, saving valuable time.\n",
    "\n",
    "Herein, we download a pretrained, fine-tuned BERT large model, trained with automatic mixed precision, from NGC.\n",
    "\n",
    "To demonstrate the potential speedups from these optimizations in demoBERT, we provide the Megatron-LM transformer model finetuned for SQuAD 2.0 task with sparsity and quantization.\n",
    "The sparse weights are generated by finetuning with INT8 Quantization Aware Training recipe. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "181f2505-5cf2-4ccf-9bc6-a08290b41578",
   "metadata": {},
   "outputs": [],
   "source": [
    "!bash ../scripts/download_model.sh 384 # BERT-large model checkpoint\n",
    "!bash ../scripts/download_model.sh pyt megatron-large int8-qat sparse # Megatron-LM model weights"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fa4218b",
   "metadata": {},
   "source": [
    "### Install extra dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8c51b43",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd /tmp && git clone https://github.com/vinhngx/transformers && cd transformers && pip install .\n",
    "!pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d879fba2-1bfe-45d9-9fda-c816f545ca87",
   "metadata": {},
   "source": [
    "<a id=\"2\"></a>\n",
    "\n",
    "## 2. Building an INT8-Sparsity TensorRT optimized BERT model\n",
    "\n",
    "In this section, we will be optimizing the BERT model for inference with TRT using INT8 precision while leveraging structured sparsity.\n",
    "\n",
    "\n",
    "## What Is Sparsity in AI?\n",
    "The brain cannot be fully connected: 10^11 nerve cells, but only upto 10^4 connections each.\n",
    "In AI inference and machine learning, sparsity refers to a matrix of numbers that includes many zeros or values that will not significantly impact a calculation.\n",
    "\n",
    "### Fine-grained structured sparsity\n",
    "NVIDIA Ampere GPU architecture introduces the concept of fine-grained structured sparsity. On the NVIDIA A100 GPU, the structure manifests as a 2:4 pattern: out of every four elements, at least two must be zero. This reduces the data footprint and bandwidth of one matrix multiply (also known as GEMM) operand by 2x and doubles throughput by skipping the computation of the zero values using new NVIDIA Sparse Tensor Cores.\n",
    "\n",
    "<img src=\"structured_spare_matrix.jpg\" style=\"width: 200px;\"/>\n",
    "\n",
    "Fine-grained structured sparsity results in even load balancing, regular memory accesses, and 2x math efficiency with no loss in network accuracy.\n",
    "\n",
    "<img src=\"sparsity-diagram-600x338-r3.jpg\">\n",
    "\n",
    "### Training recipe\n",
    "[ASP](https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity) (Automatic SParsity) is a tool that enables sparse training and inference for PyTorch models by adding 2 lines of Python.\n",
    "\n",
    "\n",
    "See our GTC session titled [Integer Quantization for DNN Inference Acceleration](https://developer.nvidia.com/gtc/2020/video/s22075-vid) for further info.\n",
    "\n",
    "\n",
    "In the following code, we will be downloading a pretrained BERT model from NGC that has been trained with QAT and ASP. This model is ready to be used in TRT INT8 mode while leveraging structured sparsity features on NVIDIA A100 GPUs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2599b25d-cbaf-4544-bb9d-64e378febccc",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorrt as trt;\n",
    "TRT_VERSION = trt.__version__\n",
    "\n",
    "print(\"TensorRT version: {}\".format(TRT_VERSION))\n",
    "!mkdir -p engines_$TRT_VERSION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a13df18-2b61-48ba-a1e8-f047100a209a",
   "metadata": {},
   "outputs": [],
   "source": [
    "BATCH_SIZE = 128"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52e7eddc-9536-44b8-9b4b-5c70ca040fc2",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!export CKPT_PATH=models/fine-tuned/bert_pyt_statedict_megatron_sparse_int8qat_v21.03.0/bert_pyt_statedict_megatron_sparse_int8_qat\n",
    "!python3 ../builder_varseqlen.py -w 40000 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1 -b $BATCH_SIZE -s 384 -o engines_$TRT_VERSION/megatron_large_seqlen384_int8qat_sparse.engine --fp16 --int8 --strict -il --megatron --pickle models/fine-tuned/bert_pyt_statedict_megatron_sparse_int8qat_v21.03.0/bert_pyt_statedict_megatron_sparse_int8_qat -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt -sp\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c944882f-d6ce-4719-8824-fb0b6fe682a2",
   "metadata": {},
   "source": [
    "<a id=\"3\"></a>\n",
    "## 3. Running inference examples\n",
    "\n",
    "Now that we've got a TensorRT engine, the inference workflow using the optimized network is as follows:\n",
    "\n",
    " -   Start the TensorRT runtime with this engine.\n",
    " -   Feed a passage and a question to the TensorRT runtime and receive as output the answer predicted by the network.\n",
    "\n",
    "<img src=\"Figure-2-workflow-to-perform-inference-with-trt.png\">\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "be366873-b1d2-4b8f-9723-04f419a63263",
   "metadata": {},
   "outputs": [],
   "source": [
    "PASSAGE = 'TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps'\\\n",
    "'such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops'\\\n",
    "'and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep'\\\n",
    "'learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.'\n",
    "QUESTION=\"What is TensorRT?\"\n",
    "\n",
    "!python3 ../inference_varseqlen.py -e engines_$TRT_VERSION/megatron_large_seqlen384_int8qat_sparse.engine  -p $PASSAGE -q $QUESTION -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt -s 256"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "edfc1170-5da4-4204-ab71-3e51e54391a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "QUESTION=\"What is included in TensorRT?\"\n",
    "!python3 ../inference_varseqlen.py -e engines_$TRT_VERSION/megatron_large_seqlen384_int8qat_sparse.engine  -p $PASSAGE -q $QUESTION -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt -s 256"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7751aba",
   "metadata": {},
   "source": [
    "## Validation on the SQuAD dev set\n",
    "Next, we will assess the accuracy of the TensorRT-optimized INT8 BERT model on the SQuAD dev set. \n",
    "\n",
    "There are two dominant metrics used by many question answering datasets, including SQuAD: exact match (EM) and F1 score. These scores are computed on individual question+answer pairs. When multiple correct answers are possible for a given question, the maximum score over all possible correct answers is computed. Overall EM and F1 scores are computed for a model by averaging over the individual example scores.\n",
    "\n",
    "### Exact Match\n",
    "\n",
    "This metric is as simple as it sounds. For each question+answer pair, if the characters of the model's prediction exactly match the characters of (one of) the True Answer(s), EM = 1, otherwise EM = 0. This is a strict all-or-nothing metric; being off by a single character results in a score of 0. When assessing against a negative example, if the model predicts any text at all, it automatically receives a 0 for that example.\n",
    "\n",
    "### F1\n",
    "\n",
    "F1 score is a common metric for classification problems, and widely used in QA. It is appropriate when we care equally about precision and recall. In this case, it's computed over the individual words in the prediction against those in the True Answer. The number of shared words between the prediction and the truth is the basis of the F1 score: precision is the ratio of the number of shared words to the total number of words in the prediction, and recall is the ratio of the number of shared words to the total number of words in the ground truth.\n",
    "\n",
    "For more info, see [reference](https://qa.fastforwardlabs.com/no%20answer/null%20threshold/bert/distilbert/exact%20match/f1/robust%20predictions/2020/06/09/Evaluating_BERT_on_SQuAD.html#Metrics-for-QA).\n",
    "\n",
    "Herein, we verify that the TensorRT INT8 model maintains a state-of-the-art accuracy of 90% F1 score on the SQuAD development set, comparable to the TensorRT FP16 model as well as the original model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f729edf0-6b9d-4dfa-8656-3608b1b5fe60",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python3 ../inference_varseqlen.py -e engines_$TRT_VERSION/megatron_large_seqlen384_int8qat_sparse.engine -s 384 -sq ./squad/dev-v1.1.json -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_384_v19.03.1/vocab.txt -o ./predictions.json\n",
    "!python3 ../squad/evaluate-v1.1.py  squad/dev-v1.1.json  ./predictions.json 90\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99df76c2-7f46-4918-8839-f5a059098af7",
   "metadata": {},
   "source": [
    "<a id=\"4\"></a>\n",
    "\n",
    "## 4. Inference benchmarking\n",
    "\n",
    "BERT can be applied both for online and offline use cases. Online NLU applications, such as conversational AI,  place tight latency budgets during inference. Several models need to execute in a sequence in response to a single user query. When used as a service, the total time a customer experiences includes compute time as well as input and output network latency. Longer times lead to a sluggish performance and a poor customer experience.\n",
    "\n",
    "While the exact latency available for a single model can vary by application, several real-time applications need the language model to execute in under 10 ms. Using a Tesla T4 GPU, BERT optimized with TensorRT can perform inference in 2.2 ms for a QA task similar to available in SQuAD with batch size =1 and sequence length = 128. Using the TensorRT optimized sample, you can execute up to a batch size of 8 for BERT-base and even higher batch sizes for models with fewer Transformer layers within the 10 ms latency budget.  It took 40 ms to execute the same task with highly optimized code on a CPU-only platform for batch size = 1, while higher batch sizes did not run to completion and exit with errors.\n",
    "\n",
    "<img src=\"./Figure-6-Compute-latency.jpg\">\n",
    "\n",
    "Next, we will perform a couple of inference benchmarks with different batch sizes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21da1622-73c6-4123-bb3d-f33d5d0c4aff",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python3 ../perf_varseqlen.py -e ./engines_$TRT_VERSION/megatron_large_seqlen384_int8qat_sparse.engine -b 1 -s 384 -i 1000 -w 500"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e32384f3-2e69-4c4d-ab81-c33fea9de5d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python3 ../perf_varseqlen.py -e ./engines_$TRT_VERSION/megatron_large_seqlen384_int8qat_sparse.engine -b 64 -s 384 -i 1000 -w 500"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
